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lOTRODUCTORY STATEMENT 



The Center for Social Organization of Schools has two primary objec^ 
tivrs: to develop a scientific knowledge of how schools affect their 
students, and to use this knowledge to develop better school practices 
and organization. 

The Center works through five programs to achieve its objectives. 
The Academic Games program has developed simulation games for use in the 
classroom. It is evaluating the effects of games on student learning 
and studying how games can Improve interpersonal relations in the schools. 
The Social Accounts program is examining how a student's education affects 
his actual occupational attainment, and how education results in different 
vocational outcomes for blacks and whites. The Schools and Maturity 
program is studying the effects of educational experience on a wide range 
of human talents, competencies, and personal dispositions in order to 
formulate — and research important educational goals other than 
traditional academic achievement. The School Organization program is 
currently concerned with authority-control structures^ task structures, 
reward systems, and peer group processes in schools. The Careers and 
Curricula program bases its work upon a theory of career development. 
It has developed a self-administered vocational guidance device and a 
self-directed career program to promote vocational development and to 
foster satisfying curricular decisions for high school , college, and 
adult populations. 

This report, prepared by the School Organization Program, discusses 
the applications of output measures in the operation and Improvement of 
schools, examines some difficulties in obtaining satisfactory measures of 
achievement growth, and outlines a new approach for developing achievement 
scores to analyze educational programs. 
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ABSTRACT 

The single most Important output of any school is probably the 
magnitude of Its students' growth in academic achievement, A variety 
of standardized tests have been developed to measure aspects of this 
achievement; however, only recently have administrators atj^empted to 
use such tests to help review and make decisions about educational 
programs. This paper describes some examples of these recent applica- 
tions of achievement tests and discusses some of the associated problems. 
In particular, one often unrecognized problem is noted: for these 
program analysis applications, it is necessary to develop a score format 
appropriate to the decision context, and one which has the properties 
of an interval scale. Some difficulties inherent in past attempts to 
develop interval scales of academic achievement are described, and 
several implications of these difficulties are mentioned. Finally, 
the suggestion is made that, with a more open-minded and pragmatic 
approach, research and development work on some of these issues can be 
done rather easily and inexpensively. Such an approach is outlined in 
the concluding section of the paper. 
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INTRODUCTION 



Adequate measurement of the outputs of schools, especially academic 
achievement in reading and arithmetic » is a logically necessary condition 
for evaluation research. In addition, however, from a school organization 
point of view, such measurement is a prerequisite if rewards in the school 
are to be distributed responsively and so be instrumental in mobilizing 
motivation. It is also needed if school personnel are to make intelligent 
operating decisions about allocation of resources and to plan for future 
demands • 

The first section of this report reviews some of the recent develop* 
ments in educational practice and theory which have caused new interest 
to be focused upon output measures. It also indicates that, as a conse- 
quence of this recent attention, a number of new problems with the tests 
used to measure academic achievement have come to light. The second 
section discusses some of the history behind these problems and the 
limited applicability of previous work in helping to produce solutions to 
them. In particular, it will be argued that the primary criteria for a 
successful scale are that (1) the scale should measure exactly the variable 
that the user is concerned with in his decision-making process; and (2) 
the scale should have interval properties that are adequately justified. 
In most of the discussions now in the literature which deal with the use 
of achievement tests in educational program analysis, both of these 
criteria have been perceived only in an approximate and vague way. The 
third section of the paper outlines a new approach to the development 
and use of scales of achievement growth for use in educational program 



analysis decisions. It Is argued that this approach Is conceptually 
defensible and practically feasible. Although the new approach Is a 
straightforward one. It leads to some surprising Inferences and Illuminates 
some neglected Issues.^ 

This whole topic of achievement tests and their use bristles with a 
number of controversial issues. The present paper will not address itself 
to most of these, even though it is recognized that they arc important and 
closely related to the issues upon which it does focus. Among the related 
issues which cannot be directly discussed here are: the narrowness of 
content coverage of achievement tests, the degree of cultural bias of 
achievement tests, the validity of these tests in various circumstances, 
the reliability of the tests, the statistical analysis of change scores 
(see, for example, Harris, 1963; Werts and Linn, 1970; Cronbach and Furby, 
1970), or the use of practical tasks, criterion-referenced tests, and 
course grades as output measures. 



The argument of this paper could have been presented differently. The 
situation could have been analyzed by stating that the task confronting 
the educational measurement expert is to find ways to specify a utility 
function for different amounts of educational change. See Melvin Lifson 
(1972) for a clear discussion of this approach. Lifson indicates that 
there are two approaches in general use for developing utility functions. 
The first is the **standard gamble** approach, following the work of 
Von Neumann and Morgenstem. The second is the direct magnitude estimation 
method developed and used by S. S. Stevens. However, the approach we 
suggest in this paper is different from either of these, and was selected 
to allow us to move from familiar to less familiar ideas. A future report 
will present an explicit comparison of these three approaches to the 
developraent of a utility scale of achievement outcomes. 
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THE RECENT INTEREST IN PERFORMANCE MEASURES 

Previous work (Cohen and Flllpczaki 1971; Klrschenbauniy Napier, and 
Simon, 1971) suggests that the ways In which schools as organizations 
monitor and reward their participants are, in general, clumsy and unfair. 
As a result, schools do not fully utilize an important potential motlva* 
tional force. Others (e.g., Owens, 1970) charge that school administration 
tends to be either too mechanized (blindly following procedures because 
they exist); or else too intuitive (following personal hunches, and making 
decisions on the basis of inarticulate feelings). 

A number of causal factors operate to inhibit the development of 
responsive and rational school administration. One is the lack of consensus 
about the relative Importance of the different objectives a school might 
adopt (Stake, 1970). In the absence of such consensus, procedures and 
priorities for achieving the objectives cannot be rationally established. 
Another obstacle is the need to discover and use motivational inducements 
for the staff and students which are effective and feasible. However, even 
if these two sets of problems are dealt with, additional problems arise 
in creating and implementing procedures for actually collecting the infor- 
mation on which to base decisions, and for actually delivering responses 
as Intended. 

These problems vary in form, depending upon the particular school 
setting and the particular output objective under consideration. At first 
glance, it would seem that many of these problems would be least serious 
for outputs such as reading achievement or mathematics achievement. For 
these outputs, there is a general consensus that they nre highly important* 
Also, because information about these outputs appears on school transcripts. 



it Influences the kinds of subsequent opportunities thai will be offered 
to a child. In other words, there are at least some motivational induce- 
ments structurally attached to the academic achievement of a student. 
Procedurally, data on these academic achievements are collected in two 
ways — by teacher-determined grades, and by standardized tests. 

Few claims arc made in the research literature for the accuracy, 
precision, or clarity of teacher-determined grades (Warrer, 1971; 
Donaldson, 1971). Traditionally, grades have been regarded as crude 
indicators. In practice, however, they have been generally accepted as 
serviceable (particularly when several grades for an individual student 
arc averaged) as a means of providing an approximate measure of that 
student's level of competence. 

Standardized achievement tests are less widely used than grades; 
however, they are quite commonly administered, and the trend in this 
direction seems to be increasing. It is generally felt that standardized 
tests are comparable across a wider variety of school situations. Also, 
they provide a more precise and objective indication of a student's 
level of knowledge than teacher-determined grades. 

However, these standardized tests are subject to a large number of 
limitations. These limitations, as well as the generally undeveloped 
state of the art in ttie measurement of academic output, have become 
apparent recently as a by-product of several large scale efforts to eval- 
uate educational programs. Most of these evaluation efforts have been 
sponsored by the Federal government; they include the Equality of 
Educational Opportunity report (1966), the evaluation of Project Headstart 
(1969), and the report of the experiment in performance contracting (1972). 



These evaluation-research efforts (and numerous other recent studies as 
well) used standardized tests to lacasure the school's output * Previously, 
achievement tests had been used for program review only In much smaller 
projects and In the context of purely academic research* The large size, 
the high public Interest, and the political sensitivity of these recent 
efforts all have contributed to the generation of considerable controversy 
about their findings and their methods (see, for example, fowles and 
Levin, 1968; Cain and Watts, 1970; Campbell and Erlebacher, 1970; Smith 
and Bissell, 1970; Guthrie, 1970). As a result, it has been concluded 
that there are a number of difficulties involved in this k:.nd of use oT 
standardized achievement tests, (e.g., Stake, 1967; Popham, 1972; 
Catii^bell and Erlebacher, 1970; Klein, 1971; Coleman and Karveit, 1972), 
Under these circumstances, any improvements that can be made to reduce 
these problems will benefit future research aimed at the large-scale 
evaluation of educational programs. 

Obviously, the difficulties faced by researchers and by educational 
administrators are not identical* Various writers (Cohen, 1970; Fennesscy, 
1972; Rossi and Williams, 1972) have enumerated some of the difficulties 
confronting the evaluation researcher. In some instances, these 
distinctive problems have been described by the researchers themselves 
(e.g.. Planar Corporation, 1972). Various other writers have outlined 
some of the problems educational administrators face when thoy consider 
the use of output measures. However, the issues faced by each of these 
groups are similar enough that a set of procedures aimed to benefit one 
would also considerably help the other* 



The utility of standardized teste was regarded as limited even prior 
to their recent use to evaluate educaclon/il programs. More recently, a 
number of writers (Dyer, 1971; Lcnnon, 1971; Rivlin, 1971) have lanented 
the extreme emphasis placed upon such tests as the criterion variable. 
Some writers have objected that the usual standardized tests encompass 
too narrow a domain (Nash and Agne, 1972). In other words, th<^y sec the 
tests as inadequate indicators of the outcomes and objectives of an educa* 
tional program* 

More technically, there has been considerable debate over the 
appropriateness of the different available score iormats for measuring 
academic achievement as part of the quantitative analysis of an educa* 
tional program. The grade-equivalent score format in particular has been 
severely criticized (Dyer, 1971; Coleman and Karveit, 1972), primarily 
because of a property that has been called "fan^spread" (Campbell, 1971). 
The issues in this debate have been many and complex. Its content has 
been highly technical, and its tone in many cases highly polemic. Yet, 
almost all the writers have neglected to consider some really fundamental 
points about score uses and score fonnat. This paper suggests two of 
these fundamental points. It then indicates their implications for the 
future use of standardized achievement tests in connection with decisions 
about educational programs, vhcther these are operating decisions or 
research decisions. 



THE MEASUREMENT OF DIFFERENCES IN ACADEMIC PERFORMANCE 



Review of Current Practice In Score Construction 

The first basic* but often neglected, point to be made about score 
format Is that the development of an appropriate score format Is possible 
only If the decision context for which the measurement Is to be used Is 
made explicit* The second basic point Is that such development Is 
accomplished only when the resulting score can be shown to be an Inter* 
val scale with respect to the quantity being measured and the comparisons 
being made. In less formal words, one must be sure that he Is measuring 
the correct variable for his purposes and that the measure Is strong 
enough to support the arithmetic operations required by the usual analysis 
techniques (l»c», parametric statistics)* 

That the choice ol? score fonnat depends upon the user's purpose is 
a p<>lnt sometimes made iln the materials accompanying the currently pub*- 
lished tests, but this caution is directed only to the context of evalua- 
ting the score of an individual student* It has not been generally 
recognized that the context of comparing ano evaluting educational 
programs imposes quite a different set of demands than does using the 
scores to locate an individual student* 

The second point, that an interval scale Is necessary, has gone 
equally unrecognized. The simple fact is that if the scores being used 
do not form an interval scale with respect to the trait being measured ^ 
then it makes no sense to add them together, or to perform the other 
operations of ordinary arithmetic. Without the use of such operations, 
almost all the techniques customarily used for analysis cannot be employed 



(see MtUer and Starr, 1967). 

The tnvlslblllty of this requirement for interval scores probably 
arises because it is mechanically possible to add scores together and 
compute, say, an average score* The thrust of the point here, however, 
is that if the scale being used is not an interval scale of the intended 
trait, then the results of such operations will not mean substantively 
what they appear to n-^an* That is, although two classrooms might have 
identical average achievement scores, the true level of achievement in 
one class might be quite different from the true level of achievement 
in the other class* Conversely, two classes might have average achieve- 
ment scores that are quite different, yet their true levels of achieve* 
ment might be nearly the same* This phenomenon would be severe to the 
extent that the scale being used were not a linear transformation of the 
true scale over the range of scores encountered in the two classrooms* 
Coleman and Karweit (1972) discuss a related point, namely, the implica* 
tions of using scales that are not related to each other linearly* Their 
cautions apply equally to any scale that is not linearly related to the 
desired underlying trait. 

A second reason that the necessity for establishing an interval 

scale is little understood by most users of achievement test data is that 

the publishers of these tests have concentrated primarily on developing 

and providing score forms that are (1) simple and intuitively meaningful, 

and (2) appropriate for describing the relative academic achievement of 

an individual child. Until recently, there has been no corresponding 

demand for the development of scores that would be appropriate for other 

uses* In the absence of such demand, publishers understandably have been 
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more or less silent about this limitation in the applicability of their 
available scores. 

A quick survey of some of the literature on achievement testing 
(S*R*A*) 1969; Lindquist and Hieronymus, 1964; Coleman and Karweit, 1972) 
reveals that there is no consensus in the field about (1) how to measure 
changes (or "growth") over time in academic performance; or (2) how to 
compare the academic performance of two groups of students. Only in 
some fairly obscure technical publications are these two problems seen 
to b3 basically identical (e.g., E. F. Gardner, 1947). Each can be 
reduced simply to a demand for a score format that has the properties 
of an interval scale. Strangely enough, little of the available litera- 
ture discusses the real problems involved in establishing interval scales 
for academic achievement, McNemar (1942) discusses related problems for 
the Stanford-Binet intelligence test; but his work is concerned only 
with that particular test, used as a measure of learning potential. The 
few pieces of work that have been done (e.g., Flanagan, 1939) are cited 
in many places, but have not been extended or updated. 

To indicate that the issue is one on which there is little agree- 
ment, it can be noted that three of the most popular tests use three 
different approaches to deal with this set of issues. The Iowa Test of 
Basic Skills is one of the most widely used series of achievement tests. 
The tests were developed and the manuals written by a well-known psycho- 
metrician, E. F. Lindquist. The I. T. B. S. manual, (Lindquist and 
Hieronymus, 1964, page 14) recommends the use of grade-equivalents for 
measuring growth. However, the new Metropolitan Achievement Tests series, 
another widely used battery, offers a special score called "standard 
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scores" and recommends these standard scores as appropriate for measuring 
growth. Yet, the M. A. T. manuals give little Information about the derl- 
vatlon of these scores and do not say why they are appropriate for Indica- 
ting growth. A third approach, preferable at least because it is more 
explicit, is exemplified by the Science Research Associates Achievement 
Series . This publisher has developed a set of "growth scores" and has 
prepared a special manual to explain these scores. This manual is both 
readable and thorough, which is no small achievement in Itself. Upon 
investigation, it seems (cf« Orr, 1.972) that the S« R« A« growth scores 
are derived by essentially the same procedure used in the derivation of 
the Metropolitan Standard Scores. One irritating aspect of this situation 
is that the details of the Metropolitan scores derivation procedure are 
not described by the publisher in any available written form. 

When one publisher recommends grade- equivalents for use in measuring 
growth, another rejects grade-equivalents and provides a special but 
obscure score for growth, and a third prepares a special set of scores 
and devotes a lengthy manual to discussing them. It seems clear that 
there is considerable disagreement in the trade about the correct 
procedures for justifying the claim that some set of scores has the 
desired properties. The most interesting point about this situation, 
however, is that the discussion which occurs does not deal with the 
procedures that might be used to develop an adequate scale, but instead 
merely repeats the exhortation to use one or another of the existing 
score forms. 
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Constructing Interval Scales for Academic Achievement 

The key property of an interval scale is that units at any one point 
on the scale are the same in real and relevant magnitude as units at any 
other point on the scale. The important question in this instance is: how 
can we create a scale for academic output that can be shown to have these 
interval properties? 

In creating an interval scale for academic achievement, there is 
no way to establish directly that any given proposed scale has the desired 
properties. That is, the most obvious way to show that a scale is indeed 
an interval scale is to fino a physical operation that corresponds to 
addition, and a physical relationship that corresponds to equality, and 
then to show that units which are numerically equal on the scale are 
indeed physically equal, and that combinations (sums or differences) of 
units which are numerically equal are in fact physically equal also 
(Coleman, 1964). This direct justification of an interval scale can be 
done quite easily with some physical quantities such as length or weight. 
It cannot be done at all, however, for academic achievement, since for 
this variable there is no physical operation that is physically the 
same as adding the two amounts, nor is there any physical relationship 
corresponding to equality. 

There is a second possible approach for establishing that a given 

scale has interval properties, and it is used in physical measurement as 

well as in the social sciences. The approach is to adopt a premise, 

based upon substantive reasons, that the underlying trait in question 

(in this case, academic achievement) has a certain specified frequency 

distribution in a certain specified population. Using actual raw scores 
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from a representative sample of this population, the premise is then ex- 
ploited by carrying out transformations of the initial (e.g. raw scores) 
scale until the distribution of scores on the resulting transformed 
scale has the same shape as that postulated for the underlying trait. 
This approach is basically an instance of construct validation, but it 
is not the content of the scale that is being validated, but rather its 
metric with respect to the trait. 

This distributional approach has been applied successfully to the 
creation of scales for the measurement of intelligence. The premise has 
been adopted, based upon substantive reasoning about the factors which 
determine true intelligence, that in a large, unselected population of 
normal persons, the distribution of the trait "intelligence" will be 
approximately "Gaussian" or normal. If this initial premise is defensible 
then scales derived by using it are also defensible. The intelligence 
tests most widely used at present have scale scores developed using this 
line of reasoning, which was first suggested by Thurstone (1925) and 
later refined by Flanagan (1951) and others. 

This same kind of reasoning — beginning with an assumed shape for 
the distribution of the true trait in a specified set of persons — has 
been used by some publishers (e.g., S.R.A.) of achievement tests in their 
efforts to create interval scores. The difficulty with using this 
approach for academic achievement is that there is no compelling reason 
to assume that the distribution of achievement scores has any particular 
shape, much less that it is normal. No matter what population is chosen 
for study, there is bound to be some diversity of educational experience 
that will affect the shape of the resulting distribution of true scores 
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on academic achievement. In other words, the argument that the trait is 
determined by a very large number of statistically independent, indivi- 
dually small causes does not apply nearly as well to academic achievement 
as it does to intelligence. One writer who has recognized this, at 
least partially, and attempted to deal with it, is Eric Gardner. 
Gardner (1950, 1947) advocated the use of a more relaxed distributional 
assumption than normality. His suggestion is that a distribution which 
allows for skewness as well as having a general bell-shape (namely, the 
Pearson Type III) be used. Gardner developed a procedure for creating 
scale scores based upon this distribution. However, while this procedure 
(jjpes remove one aspect of the restrictiveness of the normal curve assump- 
tion, it does not answer the basic objection -- namely, that it is 
unwarranted to posit any particular distributional form. 

There is no intent here to claim that the achievement scales now in 
use are completely unreasonable. On the contrary, it is likely that they 
do reflect the ordinal relations between achievement levels perfectly. 
Moreover, they probably are not extremely distorted, particularly over 
short ranges, from a true interval scale. However, in many situations 
where programs are being evaluated, the distortion might be large 
enough that the difference in scales would make a difference in the final 
result. In other words, although the distortion may be small, even 
small distortions could cloud the basic issue as to the relative effec- 
tiveness of two programs. 

The more general point about the construction of achievement scales 

is that such scales ought to be chosen to fit the purpose of the decision 

maker who will use them* According to this criterion, several of the, 
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usual scales, including the grade-equivalent scale, are appropriate for 
use by educational planners and counselors when they attempt to choose 
material for curriculum units or make placement decisions about indr/1^- 
dual students* 

For example, many curriculum packages are designed for students 
whose achievement levels fall within a certain range. If a student's 
score is outside this range, then another package would be more appro- 
priate for him, so the decision is simply which curriculum to use with a 
given child. The grade-equivalent scale, regardless of any interval 
properties, is frequently and correctly used in making such decisions. 
In other words, the grade-equivalent score format is well-suited for 
matching an individual student with the curriculum material. 

Tn a different class of situations, the percentile score can be 
quite useful, also without regard to any interval properties it may or 
may not possess. Many educational decisions involve a competitive 
admissions process. In these kinds of decisions, a finite number of 
places are available^ and. there are more applicants than can be accommo- 
dated. To fill the places, the students whose scores are highest are 
chosen in order of score until all the available places are filled. For 
this kind of decision, only the ordinal properties of the scale are needed. 
The percentile score presents this ordinal information in a convenient 
and general way, thus simplifying the task of the decision maker and the 
applicant. 

Working from a somewhat different starting point, there has recently 
been a growing movement among testing experts and educators toward what 
are referred to as criterion-referenced tests. The scoring associated 




with such tests is designed to relate the child's actual achievement to 
a set of real-life tasks. In other words, there is no claim that this 
type of test is particularly useful for the evaluation of educational 
programs, but instead that it locates students directly on dimensions 
that have clear meaning and interest for parents and prospective emplo- 
yers. These criterion-referenced tests, in other words, link an 
individual student's skill level to some common real-world situations. 

For the person who needs to compare two educational programs, how- 
ever, these scales provide little help. There is no claim that these 
score formats provide an interval scale in the context in which they are 
used, but only an ordinal one. Each is aimed primarily at being useful 
in decisions about individual students, whether for further schoolwork 
or for life-work. 

This, then, describes the current situation with regard to achieve- 
ment test score formats and the analysis of educational programs. Before 
suggesting an alternative approach that seems to show some promise, we 
need to examine briefly some implications of the existing state of 
affairs. 

Aside from the ambiguities and controversies that inappropriate 

achievement scales create in efforts to evaluate particular educational 

programs, an additional consequence Is that the confusion and possible 

distortion in the scales has aggravated the controversy about various 

approaches to the education of low-income children. A variety of topics 

relating to racial differences between blacks and whites in learning 

rates also are latent in these debates. In fact, it seems quite probable 

that these racial issues have motivated far more discussions of testing 
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than would be apparent on the surface. These discussions, already tense, 
frequently become more heated and circular because of the confusion about 
test scores and their appropriate use in comparative assessments, 

A final implication is that the lack of consensus among testing 
experts has been one major force retarding the introducution of general 
outcome monitoring (cf. Blau and Scott, 1962) and accountability programs 
on a routine basis in school operations. The reluctance of teachers and 
administrators to let themselves be measured by a possibly biased instru- 
ment is understandable. What is less understandable is the reluctance 
of many experts to attack this set of technical issues forthrightly and 
empirically. 

A DECISION-THEORETIC APPROACH 

The preceding sections of this paper have indicated that the present 
use of achievement test scores in program review and evaluation is essen- 
tially chaotic. It has been argued here that this chaos arises because 
insufficient attention has been paid to the logical requirements a set 
of scores must possess if they are to be useful for a particular purpose. 
Perhaps the very sophistication of achievement test development and 
norming procedures has made them apparently unassailable, and so in turn 
has made these other requirements less detectable. 

The Objectives of the Approach 

Our point is that, for program review decisions, it is necessary to 
measure specifically the program's impact on the child, not the actual 
level of knowledge of the student. The analyst is interested not in 
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achievement level, but In change of level. The scale he needs is a scale 
of changes, or growth, not a scale of levels. Recognizing this fact, we 
see that the scales offered by S.R.A. and the M.A.T. are irrelevant for 
comparing programs. They have interval properties, alledgedly, in terms 
of the actual amount of some knowledge that a student possesses, but this 
is not the concern of program evaluation. 

A second point is that the scores used should have interval proper- 
ties regardless of where growth occurs on the knowledge curve. That is, 
the scales should have interval properties when comparisons are made 
between two students who start at different initial levels, or between 
a student's growth during time interval 1 and his growth during time 
interval 2. The question is how such scores can be developed without 
basing them on the distributional form of construct validation. The 
alternate approach we propose is that of calibrating the new scale for 
achievement growth against a known set of educational programs which have 
equal power. 

Linear Growth 

To describe the general logic of this approach, we first need to 
conceptualize the notion of the "power'' of an educational program. This 
power is a quantitative proV)erty, so we can imagine two educational pro- 
grams which have equal power. Let us suppose that we have two such 
programs, and we call them A and B. Suppose also that program A deals with 
students in grade 3, and program B deals with students in grade 4. Then, 
if we consider a particular student, and he exerts the same level of 
effort for grade 3 and for grade 4, we would expect that, by definition. 
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his growth during grade 3 would be the same as during grade 4. This is 
what is meant by saying that it must be possible to compare a student's 
growth during one time interval with his growth during another time inter- 
val. Note that, from the point of view of the scale we desire, this implies 
that growth in the scale should be linear over rime for any given indivi- 
dual. 

More concretely, if Johnny Smith gains 10 units between September of 
grade 3 and June of grade 3, and if we can safely assume that the program 
in grade 4 is equal in power to that in grade 3, then we can expect him 
to grow 10 units between September of grade 4 and June of grade 4. If 
this same equality of program power is assumed for all the grades, then 
the trace of Johnny Smith's level of knowledge from grade 1 to grade 6 
would be linear. It would appear as in Figure lA. 
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Unlike the pattern Just described^ the growth shown by the M.A.T. 
standard scores or the S.R.A. growth scores Is generally like that o£ 
Figure IB. With this kind of growth pattern (which presumably does re- 




1 2 3 4 5 6 



TIME 
Figure IB 

fleet the psychometrically true pattern of increase in the trait) there 
is no direct way to compare growth between grade 4 and 5 with growth 
between grade 1 and 2. Thus, these scores are not appropriate for com- 
paring the results of two different programs except under very unusual 
conditions. 

It should be pointed out that these "unusual" conditions occur 
when the initial scores and the learrlng rates of the children in program 
A are exactly the same as those in program B| and when there is also no 
differential regression caused by differential matching. To achieve 
these conditions is basically to achieve the classical experimental 
design, in which allocation of individual students to program A or B is 
random. Thus, the point emerges that one way (though probably not often 
practical, if recent past experience is a guide) to circumvent the whole 
dilemma of score format is to use strict randomization of assignment. 
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It may have occurred to the reader that the scores described above 
and shown in Figure lA are not unfamiliar. They are, in fact, exactly 
the grade-equivalent score already used for other purposes. It is a 
defining property of the ordinary grade-equivalent score that, for the 
reference group on which the scores are based, these scores show a 
perfectly linear growth rate over time. This is an important point and 
makes the grade-equivalent score a strong candidate for use in program 
analysis. 

Fan-Spread 

Unfortunately, however, the grade-equivalent score lacks another 
property which would be desired in a score used for analysis. To see 
what that desirable property would be, imagine again that we have two 
educational programs whose "power" we know to be equal.. Suppose however, 
that this time both of the programs are designed for use with children 
of the third grade. Suppose also that we apply program A to a group of 
children whose initial achievement level is 2.7, and we apply program B 
to a group of children whose initial achievement level is 3.2. For the 
sake of simplicity, we can even assume that there is only one child 
exposed to each program, or that in each of the groups, every child has 
exactly the same initial score as all the others in his program. Thus, 
we sidestep any arguments about the distribution of scores in each class. 
We would find that the gains shown by the children in program B (in which 
the Initial level was higher) were greater than the gains of the children 
in program A. This difference in observed gain would not be due to any 
difference in program power, but instead to the fact that rate of gain in 
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score Is almost invariably found to be positively correlated with initial 
score when grade^-equlvalent scores are used. 

Grade-equivalents as a score format have been castigated (Dyer, 1971) 
because they possess this "fan-sprfead" property. That is, the graphic pre- 
sentation of grade-equivalent scores over time shows that initially disad- 
vantaged children (or more accurately, those v*ho initially have a low 
score level) fall progressively further behind each year (see Figure 2). 

A number of educators have rejected this pattern, and likewise the 
score format which produces it, because it suggests that the school is 
denying its impact to those who clearly need it most; it seems to be help- 
ing the "rich get richer and the poor get poorer." Our position is that 
this indictment . is unjustified and that, even if it were justified, th&t 
would be no reason for throwing away the score form. 

The absurdity of this rejection can be seen by comparing the situa- 
tion revealed by grade-equivalents with the fact that in any long-distance 
foot race, the faster runners gradually move farther and farther ahead 
(measured in feet, meters, or inches) as time progresses. Yet, no one 
suggests that our scales of distance be rejected. Rather, what is done 
is to classify racers into approximately equal-speed groups, and then 
compare their performances. To the extent that the level of knowledge 
reached by a child after a length of time depends on his effective 
learning potential, to that extent the differences in growth rate indica- 
ted by grade-equivalents are real, but are totally irrelevant to the 
comparison and analysis of educational programs. 
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Using Grade-Equivalents Properly 

Before going on to discuss some possible ways to deal with the fact 
that growth rate is proportional to initial level (because initial level 
serves as an imperfect but good indicator of effective learning potential), 
we need to mention one class of practical situations in which it does not 
represent a problem. In general, there is no reason to think that the 
effective learning potential of a particular student will change between 
time interval 1 and time interval 2, Naturally, concrete evidence would 
make this assumption questionable or even untenable; but in the absence 
of contradictory evidence, it can stand. Thus, regardless of what a 
particular student's individual learning rate is, his growth in terms of 
grade-equivalents during time interval 1 can be compared with his growth 
during time interval 2. Any differences that are observed under these 
conditions are probably the result of differential power of the two 
programs. The same reasoning holds if we have a cohort of students, and 
compare the average of their individual growths during the second time 
interval with the average of their individual growths during the first 
time interval. 

It seems likely that this longitudinal comparison of single children, 
or intact groups of children, is what Lindquist and Hieronymus (1964) had 
in mind when they recommended the grade-equivalent scores as appropriate 
for measuring growth. Since they were writing in the more traditional 
context where the program is regarded as a constant and the question 
concerns the rate of development of the individual child, it is reasonable 
to guess that they were assuming that the program to which the child was 
exposed was the same at the two times. Under that assumption, differences 

22 

ERIC 



between the growth observed during the first time interval and that observed 
during the second would be cause for counseling the child or perhaps compli- 
menting him on an outstanding effort. 

Note too that this discussion brings out the interdependence of as- 

I 

sumptions here. If the growth in the first time interval is not the same 
as that during the second time interval, it is possible only to say some - 
thing has changed. Whether that change originates in the learning rate 
of the child (or children) or in differential power of the two programs 
is a question that must be settled by examination of the relative plausi- 
bility of these two explanations. In this connection, too, it might be 
wise to recall the cautions noted by Campbell and Stanley (1967) as to the 
possible distorting effect of history on a design of this general sort. 

The practical conclusion that has emerged from the discussion thus 
far is simply the following. For situations in which there is measurement 
on the same children at three or more time points (at least two time 
intervals), it is legitimate to compare the growth (measured as differences 
in grade-equivalents) of individual children who have experienced the 
two programs as a means for comparing the programs. This strategy will 
become increasingly important as we accumulate more and more files of 
good-quality longitudinal data on achievement tests. 

Sub-Groups by Growth Rate 

The next part of our discussion considers the question of an appro- 
priate score format when there is no way to use the child as his own 
control or when strict randomization has not been used. This is the most 
common situation. The approach we suggest for dealing with it is as 

straightforward (in principle, at least) as that used in the athletic 
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world. We would simply categorize students into groups which are internally 
homogenous as far as effective learning rate is concerned, and then make our 
comparisons only within these groups. This is, logically, parallel to the 
typical categories or handicapping systems used in most sports (boxing's 
weight classes, auto racing's classes, golf's handicaps, etc.)» 

The difficulty arises when we try to make this classification in 
practice. It is in this area that we most need empirical work and dissem- 
ination of results to provide a general pool of benchmark information for 
all researchers. Some first steps in this direction are apparent. For 
example, in several recent reports by researchers dealing with programs 
for improving the performance of disadvantaged children (Donaldson, 1971; 
U, S. Office of Education, 1972), there are statements to the effect that 
the "normal" or "expected" gain rate for disadvantaged children is about 
0.7 grade-equivalent units per year. Thus, there has been an informal 
and crude partitioning of the gain rates into two categories (ordinary, 
and disadvantaged). The major problem with this particular classification 
scheme is that it is still extremely crude. 

In fact, it is cruder than it need be for most studies. During the 
past year, this author was involved as a consultant and analyst on a 
research project dealing with disadvantaged children. The project, 
sponsored by. the U. S. Office of Education, examined the feasibility and 
impact of offering monetary incentives to teachers and parents to improve 
school effect iveness* In the course of that project, we needed to 
calculate "expected" gains for each child in the project in order to 
determine' whether individual teachers would receive cash bonuses at the 
end of the year. Thus, the field workers on this study needed to calculate 
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a set of expected gains which would be perceived as fair, not just by 
researchers in an academic context^ but by real teachers for whom the 
gains would imply payment or lack of it. 

Because the project dealt only with schools containing a large pro- 
portion of severely disadvantaged children, it might be thought that the 
typical gain rate of 0.7 would be a reasonable number for use in all 
classes. However, in a number of the schools and grades, the students 
were ability-grouped, which meant that one teacher might teach only the 
relatively slow students in that school while another teacher might teach 
only the relatively bright children. Researchers and teachers quickly 
and independently arrived at the conclusion that a more refined and 
specific set of expected gains was needed. 

To provide these more specific benchmarks, the project workers used 
an approximation that seemed the best available under the circumstances. 
This approximation was developed by calculating the cohort-to-cohort 
difference between each adjacent grade in each school, separately for the 
upper third, middle third, and lower third, of all students in the school 
at that grade (Planar Corporation, 1972). Clearly, this procedure 
involves some assumptions that can be questioned. On the other hand, it 
is equally clear that it provides a substantial improvement in precision 
of prediction as compared to using a single number such as 0.7. As a 
matter of fact, the calculated gains to be expected ranged from about 0.2 
to about 1.3 grade-equivalents per class; and, for those students in the 
middle third of the ranking, were usually not far from the 0.7 used in tlie 
other studies. 

This solution, developed hastily under the pressure of real deadlines, 
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stands as a sensible compromise among a variety of conflicting criteria. 
The particular characteristics of the solution adopted in the Incentives 
Project are less important than the kinds of thought processes it reflects. 
In that project, there was a practical need to obtain expected gain 
estimates that would be as precise as possible, yet these estimates had 
to be provided within narrow constraints of time and money. In this 
situation, and in view of the absence of available gains data on similar 
populations to provide distributional information, there was reliance on 
a direct approach using cross-sectional data on the project schools 
as a substitute for the actual gains data. Evidently, the approximations 
were adequate, as is indicated by the fact that the subsequent post-test 
gains actually obtained tended to be distributed fairly closely around 
the predicted values, and were not systematically higher or lower for 
teachers regardless of the ability level of students they taught. 

SUMMARY AND CONCLUSIONS 

This paper has argued that a good deal of the recent controversy 
surrounding the uses of standardized tests is unnecessary; indeed, this 
controversy distracts attention from other related problems on which 
work is needed and possible. The wide interest in measures of achievement 
arises primarily because several recent large-scale evaluations of educa- 
tional programs have made use of standardized a^^hievement tests as if 
they provided interval scales for the variables of interest. In fact, 
most of the scales commonly used with standardized tests were not 
designed as interval scales. There are, however, some special scales 

offered by publishers with the claim that they have interval properties. 
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These latter scales are justified only by a fairly weak argument, that is, 
by the appeal to a normality of distributions which may not be the actual 
situation. More Importantly, the scales so proposed, even if they are 
accepted as being what they claim, are demonstrably not interval scales 
for the variable which is of interest in program evaluation, namely, 
program impact. Only a scale that yields equal changes when any child 
is exposed to a program with a fixed "power" can meet this criterion. 

Of the generally familiar scales, the one that seems mor.t adaptable 
for this purpose is the grade-equivalent scale. This scale has been 
mistakenly attacked in recent years, because it seems to show patterns 
that some persons find threatening. The fact is, however, that this 
scale does have one of the two desirable properties needed in any scale 
for program analysis it yields linear growth for an individual child. 
Therefore, in situations where there is comparison of the same person's 
reaction to two programs, or where strict randomized assignment has been 
used, the grade-equivalent scale is perfectly appropriate as a basis for 
calculating gains. 

For those more frequent situations in which there is a comparison 

of two non-equivalent groups a problem called "fan-spread" enters the 

picture. One very sensible and practical procedure that meets the 

problem of "fan-spread" is to stratify the population under examination 

into a number of subgroups according to the best available indicator of 

their effective learning potential. In many cases, the best available 

indicator will be the initial score, but other kinds of data can be used 

as substitutes or supplements to the initial scores. Once these subgroups 

have been defined, their expected rate of change can be estimated by 
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looking at the observed rate of change for similar groups under (presum- 
ably) similar conditions. 

The practicality of this approach was demonstrated in the recent 
Incentives in Education project. For that project, cross-sectional data 
based on the same schools and the same students were used to create the 
benchmark. However, there is a wide variety of possibilities for creating 
these beiichmarks. Thus, this approach, analogous to the calibration of 
a physicfil scale, provides a practical and flexible way to develop the 
kinds of scales that we need to analyze educational programs adequately. 

There is no short-cut, general solution to the problem of developing 
benchmarks for a variety of situations, but there are direct and feasible 
ways in which progress can be made. One useful activity would be to 
compile tabulations of the distributions of observed gains in achievement 
test scores under various conditions. There are a number of data files 
from which such tabulations could be made without enormous effort. The 
material for this sort of tabulation exists not only in the files of 

several large scale research projects, but also in the files of several 

> 

large school districts which administer standardized tests routinely. 
Work of this sort is underway now at Hopkins and elsewhere. As results 
from this sort of work accumulate, individual investigators will be less 
confined by the limitations of their own data and their own budgets in 
setting benchmarks. 

As already mentioned, there are a variety of specific approaches 
which might be considered in developing the benchmark gains for the 
stratified grade-equivalent scales of program impact. One specific 
( pproach is illustrated in the Yardstick system (Pinkham, 1970), but 
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others are possible as well. Future research in this area will provide 
information about the advantages and disadvantages of different approaches 
and the relationship between their results. This work too is feasible and 
important, but in some cases will require the collection of richer data 
than is presently available. 
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